As an analyst, you will come across time series data - probably often. Knowing some tricks for munging (cleaning/working with) and visualizing time series data will serve you well.
This tutorial will cover the general topics you need to know to be able to work with and visualize time series data. All code will be explained to give you a good understanding of what we are actually doing.
The data for this tutorial comes from NOAA (National Oceanic and Atmospheric Administration). It contains daily snowfall totals for Central Park. Reading in the data:
library(data.table)
snow <- fread('~/Downloads/noaa_snow_ny.csv')
head(snow)
## NAME DATE SNOW
## 1: NY CITY CENTRAL PARK, NY US 2000-01-20 2.5
## 2: NY CITY CENTRAL PARK, NY US 2000-01-25 5.5
## 3: NY CITY CENTRAL PARK, NY US 2000-01-30 1.5
## 4: NY CITY CENTRAL PARK, NY US 2000-02-03 2.2
## 5: NY CITY CENTRAL PARK, NY US 2000-02-18 3.0
## 6: NY CITY CENTRAL PARK, NY US 2000-03-17 0.4
We have a column for name (which we can disregard), a column for date, and a column with the amount of snow that fell that day (in inches). For now, let’s drop the name column, rename the other columns to lowercase, and move forward:
snow <- snow[,c('DATE','SNOW')]
colnames(snow) <- c('date','snow')
head(snow)
## date snow
## 1: 2000-01-20 2.5
## 2: 2000-01-25 5.5
## 3: 2000-01-30 1.5
## 4: 2000-02-03 2.2
## 5: 2000-02-18 3.0
## 6: 2000-03-17 0.4
Aside from a few specialized software packages built for time series analysis, most scripting languages require some finesse to get dates working correctly. Here is a quick run-down of how to work with dates in R.
Let’s look at the date column of the snow data:
head(snow)
## date snow
## 1: 2000-01-20 2.5
## 2: 2000-01-25 5.5
## 3: 2000-01-30 1.5
## 4: 2000-02-03 2.2
## 5: 2000-02-18 3.0
## 6: 2000-03-17 0.4
Those look like dates, but they aren’t:
class(snow$date)
## [1] "character"
R thinks of these dates as characters. This is a defense mechanism that R employs to make sure you can read in data of any type. R doesn’t assume your data is actually a date unless you specifically tell it to. This should be familiar - Excel does the same thing.
Fortunately, just like Excel, R has a date class that you can convert to:
snow$newDate <- as.Date(snow$date)
head(snow)
## date snow newDate
## 1: 2000-01-20 2.5 2000-01-20
## 2: 2000-01-25 5.5 2000-01-25
## 3: 2000-01-30 1.5 2000-01-30
## 4: 2000-02-03 2.2 2000-02-03
## 5: 2000-02-18 3.0 2000-02-18
## 6: 2000-03-17 0.4 2000-03-17
As you can see, the data doesn’t look different, but R now knows that we want to treat the dates as dates. Why does this matter? Here’s an example:
max(snow$date) - min(snow$date)
## Error in max(snow$date) - min(snow$date): non-numeric argument to binary operator
max(snow$newDate) - min(snow$newDate)
## Time difference of 6647 days
The first attempt fails because R doesn’t know how to subtract characters from each other. R does, however, know how to subtract dates.
This is just a small example to show you that formatting your data as a date is very important when working with time series data. Let’s rename our columns so that we only have one date column that contains properly formatted data:
snow$date <- as.Date(snow$date)
snow$newDate <- NULL
head(snow)
## date snow
## 1: 2000-01-20 2.5
## 2: 2000-01-25 5.5
## 3: 2000-01-30 1.5
## 4: 2000-02-03 2.2
## 5: 2000-02-18 3.0
## 6: 2000-03-17 0.4
Time series data often arrives in a data frame with a row for each observation. While the data looks clean, we need to take a closer look:
snow[1:10,]
## date snow
## 1: 2000-01-20 2.5
## 2: 2000-01-25 5.5
## 3: 2000-01-30 1.5
## 4: 2000-02-03 2.2
## 5: 2000-02-18 3.0
## 6: 2000-03-17 0.4
## 7: 2000-04-09 1.2
## 8: 2000-12-08 0.1
## 9: 2000-12-20 0.5
## 10: 2000-12-22 0.8
This data set only contains data when there is snow on a given day. The days that don’t have snow are simply left out of the data. If you attempt to visualize this data, you might get something like this:
barplot(snow$snow, names = snow$date)
This is definitely a representation of our data, but it is very misleading. The x axis is supposed to represent dates, but it is only showing the dates where there was snow. If we just want a visualization, we can solve the problem by using ggplot:
library(ggplot2)
ggplot(data = snow, aes(x = date, y = snow)) + geom_bar(stat = 'identity')
This is a much more realistic representation of our data. It only snows a few days a year in NY - this is clear from looking at this visualization.
In the above example, ggplot recognizes that there are holes in your data that you want to display. Some packages are not so friendly, and you don’t want to leave things up to chance, so you will want to reformat your time series data to include gaps. The easiest way to do this is with a tidyr function called complete():
library(tidyr)
snow <- complete(snow, date = seq.Date(min(date), max(date), by="day"))
snow$snow[is.na(snow$snow)] <- 0
print(snow)
## # A tibble: 6,648 x 2
## date snow
## <date> <dbl>
## 1 2000-01-20 2.5
## 2 2000-01-21 0
## 3 2000-01-22 0
## 4 2000-01-23 0
## 5 2000-01-24 0
## 6 2000-01-25 5.5
## 7 2000-01-26 0
## 8 2000-01-27 0
## 9 2000-01-28 0
## 10 2000-01-29 0
## # … with 6,638 more rows
OK, now our data represents a full daily time series.
In R, we like to work with data frames because they are almost universally accepted by various functions and are familiar/easy to comprehend. When working with time series data, it can be useful to use another data type called xts (extended time series). This data type is slightly less intuitive and easy to grasp, but has some very useful properties.
We can take our data frame and create a new object using xts:
library(xts)
ts <- xts(snow$snow, order.by=snow$date)
head(ts)
## [,1]
## 2000-01-20 2.5
## 2000-01-21 0.0
## 2000-01-22 0.0
## 2000-01-23 0.0
## 2000-01-24 0.0
## 2000-01-25 5.5
This doesn’t look too different from our data frame, but R now considers this to be a different class:
class(ts)
## [1] "xts" "zoo"
These time series objects are very useful for a number of reasons. One typical operation you will want to do with time series data is to aggregate the data to different time steps. This can a big pain without using xts. In xts, we can use the apply.xxx family of functions:
monthly <- apply.monthly(ts, sum)
head(monthly)
## [,1]
## 2000-01-31 9.5
## 2000-02-29 5.2
## 2000-03-31 0.4
## 2000-04-30 1.2
## 2000-05-31 0.0
## 2000-06-30 0.0
Now we have data that is aggregated to the month scale. And similarly, we can translate to the annual scale:
annual <- apply.yearly(ts, sum)
head(annual)
## [,1]
## 2000-12-31 29.7
## 2001-12-31 21.6
## 2002-12-31 14.5
## 2003-12-31 58.1
## 2004-12-31 25.8
## 2005-12-31 47.7
That might seem somewhat easy to do without xts, but consider moving to a weekly scale:
weekly <- apply.weekly(ts, sum)
head(weekly)
## [,1]
## 2000-01-23 2.5
## 2000-01-30 7.0
## 2000-02-06 2.2
## 2000-02-13 0.0
## 2000-02-20 3.0
## 2000-02-27 0.0
This aggregation would have been very difficult without xts.
Finally, you can easily translate back from a time series object to a data frame using:
annualDF <- fortify.zoo(annual)
colnames(annualDF) <- c('date','snow')
head(annualDF)
## date snow
## 1 2000-12-31 29.7
## 2 2001-12-31 21.6
## 3 2002-12-31 14.5
## 4 2003-12-31 58.1
## 5 2004-12-31 25.8
## 6 2005-12-31 47.7
Moving between time steps easily is critical to good time series analysis. It is easy to miss a trend if you are in the wrong scale. For example, temperature in NY is seasonal, but it is only seasonal by month. If you look at average annual temperatures, there is (almost) no trend in the data.
With weather, this is obvious because we live with weather every day. However, you may not understand your data this well in practice. The best idea is to try a few different aggregations when working with time series data to see if you can find anything interesting.
Another popular method for dealing with time series data is decomposition. The idea is to split out the seasonal oscillation in data, leaving you with a “true” trend line.
The easiest way to do this in R is to convert to a ts object and use the decompose() function:
monthlyDF <- fortify.zoo(monthly)
colnames(monthlyDF) <- c('date','snow')
mo_ts <- ts(monthlyDF$snow, frequency = 12)
decomp <- decompose(mo_ts)
plot(decomp)
The trend appears to show some change over time, but it doesn’t look like we are systematically getting more or less snow over time.
Why are we using bar charts? Because we’re using discrete time steps (as opposed to continuous time).
You will often see data visualized like this:
library(lubridate)
df <- annualDF[1:(nrow(annualDF)-1),]
df$year <- year(df$date)
ggplot(data = df, aes(x = year, y = snow)) + geom_line() + geom_point() +
scale_x_continuous(breaks = df$year) + ylim(0,60)
This is a common, but bad representation of our data. The problem is that we are using a continuous line to represent a discrete event. For example, this plot suggests that it snowed about 5 inches between June and August of 2000…this is obviously not true. Things like this are easy to see when it comes to snowfall, but could potentially be a big problem when you don’t fully understand your data.
This might seem like a weird thing to mention since you see data like this all the time, but you should be thinking hard about the way that you are representing data. I will argue that this is a much better interpretation of the data:
ggplot(data = df, aes(x = year, y = snow)) + geom_bar(stat = 'identity') +
scale_x_continuous(breaks = df$year) + ylim(0,60)
Adding some finishing touches:
ggplot(data = df, aes(x = year, y = snow)) + geom_bar(stat = 'identity') +
scale_x_continuous(breaks = df$year) + ylim(0,60) +
theme_minimal(base_size = 12) + ggtitle('Annual Snowfall in Central Park') +
theme(axis.text.x = element_text(angle=45, hjust = 1), panel.grid.minor.x = element_blank(), plot.title = element_text(hjust = 0.5)) +
ylab('Total Snowfall (in)') + xlab('Year')
Interactive plots can be useful in presentations and applications.
You can easily make your ggplots interactive with plotly using ggplotly():
library(plotly)
g <- ggplot(data = df, aes(x = year, y = snow)) + geom_bar(stat = 'identity') +
scale_x_continuous(breaks = df$year) + ylim(0,60) +
theme_minimal(base_size = 12) + ggtitle('Annual Snowfall in Central Park') +
theme(axis.text.x = element_text(angle=45, hjust = 1), panel.grid.minor.x = element_blank(), plot.title = element_text(hjust = 0.5)) +
ylab('Total Snowfall (in)') + xlab('Year')
ggplotly(g)
You can also use packages like c3:
library(c3)
df %>%
c3(x = 'year', y = 'snow') %>%
c3_bar() %>%
grid('y')
Or highcharter (note, this highcharts requires a license for commercial use):
library(highcharter)
hchart(df, 'column', hcaes(x = 'year', y = 'snow'))